Levy statistics in coding and non-coding nucleotide sequences 
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We propose a new method of statistical analysis of nucleotide sequences yielding the true scaling 
without requiring any form of de-trending. With the help of artificial sequences that are proved to 
be statistically equivalent to the real DNA sequences we find that power-law correlations are present 
in both coding and non-coding sequences, in accordance with the recent work of other authors. We 
also afford a compelling evidence that these long-range correlations generate Levy statistics in both 
types of sequences. 
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The recent progress in experimental techniques of 
molecular genetics has made available a wealth of genome 
data (see for example the NCBI's Gen-Bank data base of 
Rcf. This has triggered a large interest in both the 
mechanics of folding [|| and the statistical analysis of 
DNA sequences. This latter aspect, of interest for the 
present letter, has been discussed by many authors || |). 
These pioneer papers mainly focused on the controver- 
sial issue of whether long-range correlations are a prop- 
erty shared by both coding and non-coding sequences or 
are only present in non-coding sequences. The results of 
more recent papers JjJH yield the convincing conclusion 
that the former condition applies. However, some statis- 
tical aspects of the DNA sequences are still obscure, and 
it is not yet known to what extent the dynamic approach 
to DNA sequences proposed by the authors of Ref. ||] 
is a reliable picture for both coding and non-coding se- 
quences. The later work of Refs. |l0| and jllj established 
a close connection between long-range correlations and 
the emergence of non-Gaussian statistics, confirmed by 
Mohanti and Narayana Rao 0. However, according to 
the dynamic approach of Refs. ||[L2| this non-Gaussian 
statistics should be Levy, and this aspect has not yet 
been assessed with compelling evidence. 

In this letter we propose a new technique of statistical 
analysis, the Diffusion Entropy (DE) method, and we 
prove that the joint use of this new technique and of the 
Detrended Fluctuation Analysis (DFA), applied to DNA 
sequences by the authors of Ref. GJ, allows us to: 

1) establish the presence of long-range correlations in 
coding as well as in non-coding sequence; 

2) assess the Levy nature of the resulting non-Gaussian 
statistics. 

In particular we analyze the two DNA sequences stud- 
ied in Rcf. pa]. These two sequences are the hu- 
man T-cell receptor alpha/delta locus, Gen-Bank name 
HUMTCRADCV, a non-coding cromosomal fragment of 



M = 97630 bases (composed of less than 10% of coding 
regions), and the Escherichia Coli K12, Gen-Bank name 
ECO110K, a genomic fragment with M = 111401 bases 
consisting of mostly coding regions (it contains more that 
80% of coding regions). We build up a random walk tra- 
jectory in the cc-space with the following prescription [|| . 
The site position t is interpreted as "time" . The walker 
x{t) = x{t — 1) +£(t) takes a step up [£(t) = +1] for each 
pyrimidine at position t, and a step down [£(£) = — 1] 
for each purine. Thus a DNA sequence becomes equiva- 
lent to a single trajectory from which we have to derive 
many distinct trajectories as we shall show below. The 
basic tenet of many techniques, currently used to analyze 
time series, is the detection of scaling [jl4],[l5|. Scaling is 
a property of diffusion processes where reference to the 
same distribution form can be done by relating the space 
variable x to the time variable t via the key relation: 

X(Xt H . (1) 

Ordinary Brownian motion has a time auto-correlation 
function ${(£) equal to zero, except for ^(O) = 1 , and 
is known to yield H = 1/2. The detection of H ^ 1/2 
implies instead the presence of extended correlation, i.e. 
a correlation function 4>£(i) described by a power law, 
which, in turn, can be interpreted as a signature of the 
complex nature of the observed process. The detection 
of the true scaling, however, often involves the adoption 
of detrending procedures, since a steady bias hidden in 
the data produces effects which might be mistaken for 
a striking departure from Brownian diffusion, while the 
interesting form of scaling must be of totally statistical 
nature. In the case of the DNA walk, the different tra- 
jectories of the diffusion process are generated in the fol- 
lowing way. For each time t we can construct M — t + 1 
trajectories of length t: 
j+t-i 

x A t ) = J2 6. j = l,2,...M-t+l , (2) 
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where Xj(t) represents the position of the trajectory j at 
time t. Scaling can be studied by direct evaluation of the 
time behavior of the variance of the diffusion process: 
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We note that this choice of trajectories is based on a 
window of size t the left side of which moves from the 
position j = 1 to the position M — t + 1. The DFA rests 
on a much smaller number of non-overlapping windows, 
whose left side is located at the positions l,t+l,2t+l...., 
and so on. For any of these non-overlapping windows the 
DFA considers only the difference between the actual se- 
quence value and a local trend GJ|. The DE method 
uses, on the contrary, the overlapping windows of Eq.(||). 
This method of analysis, shown in action here for the first 
time on DNA sequences, is derived from that recently ap- 
plied to the analysis of time series of sociological interest 
p6[ , and more details on it are given in ref. |L7||. Here 
we limit ourselves to explaining the motivation for the 
choice of the overlapping windows of Eq. (Q). In addi- 
tion to increasing the statistical accuracy of the analysis, 
the use of overlapping windows is the same prescription 
as that dictated, at least in principle, by the rules for the 
calculation of the Kolmogorov-Sinai entropy |l8|]l9| 1. The 
DE shares with the KS the use of the Shannon entropy 
indicator, as we shall see later, and also the same pre- 
scription to convert one single trajectory in a large set 
of distinct trajectories. The DE uses these trajectories 
to determine the scaling of the diffusion process that is 
generated by the spreading of these trajectories. The KS 
evaluates instead the rate of the entropy increase associ- 
ated to this spreading |2(J. If this spreading is indepen- 
dent of biases, the DE determines the scaling associated 
to this spreading without requiring de-trending, since the 
scaling is determined by the entropy increase and this is 
virtually independent of biases. 

To evaluate the Shannon entropy of the diffusion pro- 
cess at time t we partition the x-axis into cells of size 
e = 1, and we define S(t) as: 



S(t) = -^pt(t)ln]pi(t)], 



(4) 



where Pi(t) is the probability that x can be found in the 
«-th cell at time t: 



Ni(t) 



(Af-t + 1)' 



(5) 



and Ni(t) is the number of trajectories found in the cell 
i at a given time t. The connection between S(t) and 
scaling becomes evident in the continuous approxima- 
tion, where the trajectories of the DNA walk of eq.(|^) 
are described by the continuous equation of motion: 



dx 
~dt 



ffi- 



(6) 



Here is the dichotomous variable assuming the val- 
ues +1 and — 1, and t is thought of as a continuous 
time. In this case the Shannon entropy reads S(t) = 
— dxp(x,t)ln\p(x,t)]. We assume: 



p(x,t) 



t 5(t) r \ t S(t) 
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This is a generalization of the ordinary scaling assump- 
tions that can be recovered by setting 8(t) equal to the 
time independent scaling parameter H . For the sake of 
simplicity we keep the ordinary assumption of a fixed 
form of statistics, expressed by the analytical form of the 
coefficient A defined in Eqs.(||) and (g). Using Eq.(0), 
after a simple algebra, we get for the entropy: 



S(t) = A + S{t) ln(t), 



where 



A = - 



dyF(y) ln[F(y)]. 



(8) 
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The diffusion entropy is a linear function of the logarithm 
of t, with a slope equal to 5(t), and this makes the slope 
measurement equivalent to the scaling detection. 
Let us now consider the two following possibilities: 
1) If ^ (t) is an uncorrelated dichotomous variable, F(y) 
has a Gaussian form: 



FGauss{y) 



exp 



(-£) 
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and then the diffusion entropy of Eq. (|J) reads 
S{t)= l -[l + \n{2-Ka 2 )] +~ln(t). 



(10) 



(11) 



2) If, instead, has the power-law correlation func- 
tion ~ l/i M , with < (3 < 1, the distribution 
density of sojourn times in one of the two states +1 or 
-1, is known @ to get the form # f (f) ~ l/t 13 , 
with fi = (3 + 2. This implies a divergent second mo- 
ment and consequently |[l| the F(y) getting the form of 
a stable Levy distribution, thereby yielding: 



S(t) — A^ eV y + 



1 



M-l 



In(t). 



(12) 



For both cases we expect S(t) to be a linear function 
of ln(t), with slope 6 = 0.5 and 6 = l/(/i — 1), in the 
uncorrelated and correlated case, respectively. We note 
that uncorrelated Gaussian cases exist EQ], where S = 
(4-/i)/2. 

We are now ready to consider the applications to the 
two DNA sequences. In Fig. la we show that the DE 
analysis of the non-coding sequence HUMTCRADCV re- 
sults in a scaling changing with time, and correlated dif- 
fusion shows up at both the short-time and the long- 
time scale. This is pointed out by means of two straight 



2 



lines of different slopes: the scaling in the short-time 
regime 5 = 0.615 coincides exactly with the value found 
by means the DFA analysis O], while the real asymp- 
totic scaling is S — 0.565 corresponding to fi = 2.77 (see 
eq.dlT 
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• HUMTCRADCV non coding 
y=0.67 +0.615 *log(x) 
y= 0.945 + 0.5S5 * log(K) 



• ECO110K coding 
y= 0.67 + 0.52* log(x) 
y=-0.01 + 0.67 * iog(x) 
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FIG. 1. The diffusion entropy analysis for the two DNA 
sequences results in a scaling changing with time. For the 
HUMTCRADCV, the non-coding cromosomal fragment, the 
slope of the straight line is 8 = 0.615 at short-time regime, 
and 8 = 0.565 at long-time regime. For ECO110K, the coding 
genomic fragment, slopes are 8 — 0.52 at short-time regime 
and 8 = 0.67 at long-time regime 

In Fig. lb we consider the more delicate problem of a 
coding sequence: for ECO110K we observe at short time 
a slope 6 — 0.52, very close to that of ordinary random 
walk, and at long-time a correlated diffusion with 5 = 
0.67, corresponding to /i = 2.5. We note that the authors 
of Ref. using the DFA find in the short-time regime 
an uncorrelated diffusion with Sy = 0.51 in agreement 
with the DE, and in the long-time regime a scaling dy = 
0.75, which apparently conflicts with the finding of the 
DE method, yielding S = 0.67. Note that the symbol 
Sv, with V standing for variance, refers to the scaling 
detected by means of the DFA, which is in fact based on 
the variance measurement. Actually, we can prove that 
this apparent conflict yields a strong support to the main 
finding of our paper, that the DE method reveals the 
long-range correlations and the true asymptotic scaling 
of both coding and non-coding sequences. 

In order to do so, we model a DNA sequence by adopt- 
ing the Copying Mistaken Map (CMM) of Ref. @. As 
pointed out more recently pT|, this model is equivalent 



to the Generalized Levy Walk (GLW) [|. The GLW, 
in turn, fits very well the observation made by the au- 
thors of Ref. (l3) that the transition to super-diffusion in 
the long-time region is a manifestation of random walk 
patches with bias. The CMM corresponds to a picture 
where Nature builds up the real DNA sequence, either 
coding or non-coding, by using two different sequences. 
The former is a Random Sequence (RS) equivalent to 
assigning to any site the value +1 or -1 with equal prob- 
ability. The latter sequence, on the contrary, is highly 
correlated and is obtained as follows. First of all, a se- 
quence of integer numbers I > is drawn, with the inverse 
power law distribution: 



p(0 
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(T + iy 



2 < n < 3 



(13) 



Any drawing corresponds to fixing the length of a se- 
quence of patches. To any patch is then assigned a sign, 
either +1 or -1, by tossing a coin. This prescription is 
virtually the same as that adopted to build up the sym- 
bolic sequence of Ref. j22|, and corresponds to the in- 
termittent condition of the Manneville map ]2^ j. We 
call this correlated sequence Intermittent Randomness 
Sequence (IRS). As shown in refs. p2] , pl[ , the diffusion 
process generated by the IRS is a Levy diffusion. Ac- 
cording to the CMM, Nature builds up the real DNA 
sequence by adopting for any site of the real sequence 
the nucleotide occupying the same site in the RS, with 
probability pr, or the corresponding one of the IRS with 
probability pl = 1 — Pr- The same prescription is used 
for modeling both the coding and non-coding DNA se- 
quences, the only difference being in pr, i.e. in the per- 
centage of correlated to uncorrelated component: in par- 
ticular the condition pr 3> pl is valid for the coding 
DNA. The Levy diffusion is faster than ordinary diffu- 
sion, and therefore is expected to become predominant, 
and so ostensible at long times, even when pr 3> Pl- Of 
course, upon increase of pr Levy statistics become os- 
tensible at longer and longer times. As shown in Fig. 2, 
the DE of HUMTCRADCV and ECO110K is perfectly 
reproduced by a CMM with /1 = 2.77 and /i = 2.5, re- 
spectively. For the coding sequence pr = 0.943, i.e. the 
random component is predominant, while for the non- 
coding sequence pr — 0.560. It is worth to notice that 
with such values of pr the CMM also accounts for the 
correct slope of S(t) vs. ln(t) in the short-time regime. 

Finally, we want to illustrate an important property of 
the DE method. The DE detects the real scaling of the 
distribution S, rather than the second moment scaling 5y- 
The two scaling values are identical only in the Gaussian 
case. In the Levy case they are related [Ol the one to 
the other by: 



6 = 



1 



(14) 
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We see that in the case of the non-coding sequence 
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the DE yields an asymptotic scaling which is slightly 
smaller than the short-time scaling. This corresponds 
to the transition from the short-time Gaussian condition 
to the long-time Levy condition, namely, to the tran- 
sition from 5 = 5y = 0.61 at short time to the value 
S = — 1) = 0.565 of the Levy regime, with delta 
related now to Sy — 0.61 by Eq. (fl4|). In the coding 
case we see that the scaling detected by the DE method 
is 6 = 0.67 that again is related to 8y = 0.75 through 
Eq. (0). 



of statistical analysis so accurate as to perceive the dif- 
ference between Levy and Gauss scaling. 




FIG. 2. CMM simulation of the two DNA sequences. 
Fig. 2a shows the comparison between the DE analysis of 
HUMTCRADCV and an artificial sequence corresponding to 
the CMM model with p R = 0.56, T = 0.43, fi = 2.77. Fig. 2b 
shows the comparison between the DE analysis of ECO110K 
and an artificial sequence corresponding to the CMM model 
with p R = 0.943, T = 45, fj, = 2.5. 

In conclusion, this paper affords two important results. 
It proves that the DE method is a very reliable technique 
that detects the real scaling, and the real scaling does 
not coincide in general with that given by the DFA. The 
second result is that the joint use of the DE and DFA 
makes it possible to prove that the CMM, or the GLW, 
which is totally equivalent to the CMM [|ll] , accounts for 
both coding and non-coding sequences. All this strength- 
ens the idea that both non-coding and coding DNA se- 
quences yield in the long-time limit an evident manifesta- 
tion of long-range correlations, and confirms the claims of 
Ref. JTT| , where the non-Gaussian nature of the long-time 
regime was interpreted as a sign of the Levy character of 
this region. The Levy nature of the long-time statistics 
is now made compelling by the use of the DE, a method 
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